A data set containing 1599 observations on 11 attributes of wine and wine quality is explored to find the composition of attributes resulting in higher quality wines. The attributes with the highest correlation to wine quality are: alcohol, volatile acidity, sulphates, and citric acid. However, it may be composition consistency that most affects wine quality.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The Wine dataset contains 1599 observations on 13 variables of red wine. The variables include 11 attributes of wine (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, and alcohol) and wine quality ratings. The data was originally collected by Cortez et al., (2009) on Portuguese “Vinho Verde” red wine. While the attributes were obtained through objective physicochemical tests, the wine quality was obtained through expert ratings (the median rating of at least three wine experts). In this data set, the wine quality and the factors that lead to a higher quality wine will be explored. Combinations of wine attributes is used to build a predictive model of wine quality.
Firstly, the wine variables are investigated individually to determine the patterns of each variable. Since there are a number of variables, a matrix plot is created to show an overview of the data and visualize which variables have interesting patterns to further investigate.
According to the matrix plot, a few of the variables have interesting features for further investigation. For example, the citric acid variable has no clear Guassian distribution, the alcohol content is right-skewed, and the wine quality is bimodal. In addition, there seems to be some obvious correlation between some of the variables and wine quality, these include alcohol (r = 0.50, p < 0.001), volatile acidity (r = -0.40, p < 0.001), citric acid (r = 0.23, p < 0.001), and sulphates (r = 0.22, p < 0.001), . These variables will be further investigated.
From the initial citric acid histogram, the trends in the information is difficult to dicipher. However, after adjusting the binwidth, there is a visible pattern of intermittently higher levels of citric acid. The highest level is at 0, then right before 0.25 and 0.50 g/dm^3.
length(unique(wine$citric.acid))
## [1] 80
sort(table(wine$citric.acid), decreasing = T)
##
## 0 0.49 0.24 0.02 0.26 0.1 0.01 0.08 0.21 0.32 0.03 0.09 0.3 0.31 0.04
## 132 68 51 50 38 35 33 33 33 32 30 30 30 30 29
## 0.4 0.42 0.39 0.12 0.22 0.25 0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18
## 29 29 28 27 27 27 25 25 25 24 24 23 23 22 22
## 0.45 0.14 0.19 0.29 0.05 0.27 0.36 0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52
## 22 21 21 21 20 20 20 20 19 19 19 19 18 18 17
## 0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57
## 16 16 15 15 14 14 14 13 13 13 12 11 10 9 9
## 0.58 0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67 0.7 0.62 0.71
## 9 9 9 8 8 7 4 4 3 3 2 2 2 1 1
## 0.72 0.75 0.78 0.79 1
## 1 1 1 1 1
After looking into the number of unique citric acid values, it shows that citric acid in general ranges from 0-0.79, with an outlier at 1.00 g/dm^3. The table confirms my earlier theory that there is indeed a spike at 0.0, 0.24 and 0.49 g/dm^3 of citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The alcohol content is right skewed with the highest peak at 9.50% alcohol content. Most of the wines are between 9.50% (first quartile) and 11.10% (third quartile). It is interesting that most of the wines try to meet the 9.0-9.5% alcohol content threshold. This is not surprising as wine is quoted to have on average between 9%-16% alcohol. It is however surprising, that most of the wines on this list try to make the 9% threshold for wine, but most do not try to meet the 12% threshold that is normal of red wines (Jancis, 2006). It would be interesting to see how this affects the wine quality rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
According to the plot and summary, most of the wines are rated between a score of 5 and 6, with the mean being 5.64 and the median being 6.00. There are no wines rated lower than 3 and higher than 8. I wonder how this will affect quality correlations without having data on very low quality wines (quality 0-2) and very high quality wines (quality 9-10).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
According to the histogram plot and summary, fixed acidity has a Gaussian curve that is slightly right skewed, with most wines having a fixed acidity between 7.10 g/dm^3 and 9.20 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile acidity tends to range between 0.39 g/dm^3 and 0.64 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual sugar has an interesting curve with most having between 1.90 g/dm^3 to 2.60 g/dm^3. However there are various wines with much higher levels of sugar. This might indicate different types of wine that usually contain more sugar and might bias the effect on quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides mostly range between 0.07 g/dm^3 to 0.09 g/dm^3. However there are outliers higher than 0.60 g/dm^3. The very high outliers would be interesting to see how it affects quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur dioxide is right-skewed with most ranging between 7.0 g/dm^3 to 21.0 g/dm^3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total sulfur dioxide shows an interesting trend with a floor limit. It is also very right-skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density has a very normal Gaussian curve with very few outliers. This may indicate that for wines, there is not a lot of deviation for density.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Similar to density, pH has a very normal Gaussian curve with few outliers. This may indicate that there is very little room for deviation in physical properties in regard to wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates mostly range between 0.55 g/dm^3 to 0.73 g/dm^3, with few outliers above 1.50 g/dm^3.
Most of the wine attributes displayed a Gaussian curve. Of these attributes citric acid, alcohol, and quality had interesting trends. Citric acid showed intermittent peaks at 0, 0.24, and 0.49 g/dm^3. The alcohol content was highest at 9.5% and was right-skewed, even with a log10 transformation. The wine quality was overwhelmingly frequent at ratings of 5 and 6.
To determine which attributes result in higher quality wines, the relationship between various wine attributes and quality is further explored. In the previous matrix plot, it was determined that alcohol, volatile acidity, citric acid, and sulphates have the highest correlation with quality, therefore they are further plotted and analyzed to find their contribution to quality. In addition, to represent the sulfur dioxide groups, total sulfur dioxide is also investigated.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Due to the overplotting in the first alcohol vs quality scatter plot, it is difficult to discern whether or not there is a correlation between the two variables. In addition, it is difficult to see where the values are repeated. To fix the problem of overplotting, I corrected the alpha value to 1/10 in the second plot. This made it easier to see the slight positive correlation in the plot and the density of the values around 10% alcohol and 5-6 quality. I also added a linear model line to better see the correlation between alcohol content and quality rating (r = 0.48, p < 0.001). Although the correlation is only moderate and not very strong, the very low p-value indicates a reliable correlation.
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
The scatter plot with the linear model shows that there is a negative correlation between volatile acidity and wine quality. The Pearson’s correlation co-efficient test confirms this with r = -0.39, p < 0.001. This is unsurprising since volatile acidity is essentially acetic acid in wine, which according to Lopez et al. (2009), can lead to a vinegar-like taste.
The boxplot shows that as the quality of the wine increases, the level of volatile acidity decreases until it plateaus at 0.4 g/dm^3 between wine quality 7 and 8. To find the difference between the levels of volatile acidity between wines with quality 7 and 8, a subset summary is conducted.
with(subset(wine, quality == 7), summary(volatile.acidity))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
with(subset(wine, quality == 8), summary(volatile.acidity))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
sum(wine$quality == 7)
## [1] 199
sum(wine$quality == 8)
## [1] 18
According to the summary, wines with a quality of 8 surprisingly have a higher volatile acidity level (M = 0.42) than wines with quality 7 (M = 0.40). I wonder if the slight increase indicates a polynomial relationship in which higher quality wines begin to increase in volatile acidity. The sum was then calculated, and since there were only 18 counts for wines with quality 8, the number is too low to imply a polynomial curve. All that could be suggested is a negative correlation between wine quality and volatile acidity which plateaus at around quality 7-8.
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The scatter plot with the linear model shows a positive correlation between sulphates and quality (r = 0.25, p < 0.001).
The boxplots indicate that there is an increase in the level of sulphates as quality increases. When zoomed in, the boxplot hints at a slight polynomial relationship, in which there are plateaus in sulphate levels between quality ratings 3-4 and 7-8. The level of sulphates increase much more rapidly between quality 5-7.
by(wine$sulphates, wine$quality, summary)
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
The summary analysis confirms my theory that there is a slight polynomial relationship between sulphates and quality, as the sulphate level increases much more rapidly between quality 5-7.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
The scatter plot with the linear model shows a positive correlation between citric acid and quality (r = 0.23, p < 0.001). Which is unsurprising since citric acid “can add ‘freshness’ and flavor to wines” (Lopez et al., 2009).
The boxplot shows that for increasing level of quality rating, the level of citric acid also increases.
##
## Pearson's product-moment correlation
##
## data: wine$total.sulfur.dioxide and wine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
Although total sulfur dioxide has a negative correlation with quality (r = -0.19, p < 0.001), it is evident in the boxplot and line plot that this is not a linear relationship. There seems to be higher levels of total sulfur dioxide in mid-level quality wines, while lower levels of total sulfur dioxide in low and high quality wines. This may be explained by the fact that sulfur dioxide helps prevent microbial growth, but too much of it can lead to an overwhelming flavor. Perhaps low quality wines do not put enough to prevent bad tasting microbial growth and it takes a good quality wine to be able to add trace amounts while controlling the microbial growth.
The attribute with the strongest correlation to quality is alchol, with a positive correlation. The next highest correlation with quality is the negative correlation with volatile acidity. Then a positive correlation with citric acid and sulphates. Despite having a slight negative correlation with total sulfur dioxide, the plots show that the relationship is non-linear with higher levels of sulfur dioxide in mid-level quality wines.
To find the combination of attributes that contribute to wine quality, different attributes of wine are correlated with each other. This helps determine if an attribute is the reason for a higher wine quality or if there is a third variable, such as another attribute that is correlated with both.
The first choice for cross-attribute correlation is between the acid variables. Since they are all acids, it is predicted that they would correlate with each other. This theory is confirmed by the stacked scatter plots. Since the linear models for both of the plots involving citric acid have the steepest slope, it suggests that citric acid is the variable that is most correlated among the acids. Considering that volatile acidity is the acid that easily evaporates, it would make sense that as volatile acidity decreases, the amount of fixed acidity increases.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
Despite having variations in density and fixed acidity, the plot exhibits an overall steady positive trend between density and fixed acidity (r = 0.66, p < 0.001). This is very reasonable since tartaric acid, the chemical in fixed acidity, has a density of 1790 kg/m^3 which is higher than that of water, which is 1000 kg/m^3. By adding a smoothing curve, it becomes visually easier to see the positive trend. A linear model seemed to overfit the data, therefore I chose to use a normal smoothing to retain the curve of the trend.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
The plot shows that wines with alcohols of various levels could have densities of various levels, but in general, they have a negative correlation (r = -0.50, p < 0.001). This is reasonable since alcohol has a density of 789 kg/m^3, which is lower than 1000 kg/m^3, the density of water.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$residual.sugar
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
By adding a smoothing line, the curve trend in the plot is easier to see. The density curve seems to increase gradually and steeply from residual sugar 0-4 g/dm^3. However, it increases more sinusoidally afterwards. The reason for this is unclear, however it may be because there are more data values below 4 g/dm^3 of residual sugar, and a lot less at higher levels of sugar, thus skewing the plot.
##
## Pearson's product-moment correlation
##
## data: wine$sulphates and wine$chlorides
## t = 15.978, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3282127 0.4127694
## sample estimates:
## cor
## 0.3712605
## 95%
## 0.1261
## 95%
## 0.93
##
## Pearson's product-moment correlation
##
## data: sulphates and chlorides
## t = -2.0202, df = 1466, p-value = 0.04354
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.103573750 -0.001532528
## sample estimates:
## cor
## -0.05269068
Although here is a positive correlation (r = 0.37, p < 0.001) between chlorides and sulphates, the scatter plot does not show a clear trend. So I tried removing the top 5% of both variables and correlating only up to the 95 percentile. The correlation dropped to -0.05, which is negligible. Showing that without the extreme outliers, chlorides and sulphates do not actually correlate.
After exploring the attributes and their relationship to each other, the data suggests that various attributes are correlated with each other. The acid attributes are all correlated. Citric acid and fixed acidity are positively correlated, while they both are negatively correlated with volatile acidity. Density is positively correlated with fixed acidity and negatively correlated with alcohol. Which is expected because tartaric acid in fixed acidity has a higher density than water and alcohol has a lower density than water, so more acid would raise the density and more alcohol would lower the density. Density is also positively correlated with residual sugar, however the trend is much stronger for residual sugar levels below 4 g/dm^3 as there are more data points. At first, it would seem that sulphates and chlorides are correlated, but after plotting their relationship and removing the top 5% outliers, the data suggests that they are not correlated. From this cross-attribute investigation, I discovered that many of the wine attributes are correlated and care should be taken when ascribing quality contributions to any one particular wine attribute.
After seeing that many of the wine attributes are correlated, I created a multivariable analysis exploring the effect of a combination of attributes on quality. The first of these explorations is the combination of the attributes that correlated the most with quality, which includes: alcohol, volatile acidity, and sulphates. Afterwards, I explored a combination of the attributes that were described as having a flavor. Of these, volatile acidity was described as possibly having a “vinegar taste”, citric acid as adding a “freshness taste” and residual sugar as adding a “sweet” taste (Lopez et al., 2009). Lastly, I explored the physical property, density, and the chemical property, pH, on the quality of wines.
## 25%
## 0.09
## 50%
## 0.26
## 75%
## 0.42
## 100%
## 1
I would like to facet wrap the plot by citric acid, so I checked its distribution and decided to cut it into quartiles to evenly distribute the data values.
The two plots on the bottom have higher concentrations of higher quality wines. From the previous correlations, it is unsurprising that higher quality wines are correlated with higher levels of citric acid and lower levels of volatile acidity. However, it is interesting to see that although the two plots at the bottom seem to both have a decent amount of higher quality wines, their distribution spread is quite different. The bottom left plot is much more concentrated and the bottom right plot is more spread out.
## 25%
## 0.9956
## 50%
## 0.99675
## 75%
## 0.997835
## 100%
## 1.00369
To facet by density, I checked its distribution, which were all very close together ranging between 0.99-1.00 g/cm^3. This is unsurprising since wine should be near water density. Afterwards, the density was cut into quartile buckets.
The data points in the plot are centered and relatively scattered. The higher quality wines are also relatively scattered without too much visible correlation. However, it is interesting that at densities below 0.992 g/cm^3, there are only higher quality wines (7-8 quality).
From the multivariable analysis of the combinations of the top correlated attributes, taste attributes, and property attributes, it is evident that wine quality is affected by many wine attributes and not just a singular one. By seeing the scattered variation of wine quality in the taste attributes plot and the density and pH plot, it implies that wine of high quality can have various quantities of the different attributes. This makes the quality of wine harder to predict, due to its nuanced combination of attributes.
This exploration compares the amount of residual sugar and total sulfur dioxide between wines of low quality (quality 3 and 4) and wines of high quality (quality 7 and 8) to find if there are notable differences between low and high quality wines. The wines are cut into alcohol buckets to compare wines of similar types.
Wines of low quality have varying levels of residual sugar across alcohol buckets. Whereas wines of high quality tend to have consistent levels of sugar across all alcohol buckets.
There is a similar trend with total sulfur dioxide where the low quality wines tend to have varying levels of total sulfur dioxide and the high quality wines tend to be more consistent.
Despite wine quality being a complicated number to predict exactly, there is enough data and information on it to make a rough predictive model.
A predictor model using linear correlation is built to predict the quality of wine. The model is built using wine attributes that have the highest correlation with quality, which includes, alcohol, volatile acidity, sulphates, density, citric acid, and chlorides.
##
## Calls:
## model1: lm(formula = I(quality) ~ I(alcohol), data = wine)
## model2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = wine)
## model3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates,
## data = wine)
## model4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid, data = wine)
## model5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + density, data = wine)
## model6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + density + chlorides, data = wine)
##
## ========================================================================================
## model1 model2 model3 model4 model5 model6
## ----------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646*** -12.504 -7.209
## (0.175) (0.184) (0.196) (0.201) (11.964) (11.982)
## I(alcohol) 0.361*** 0.314*** 0.309*** 0.309*** 0.323*** 0.301***
## (0.017) (0.016) (0.016) (0.016) (0.019) (0.020)
## volatile.acidity -1.384*** -1.221*** -1.265*** -1.301*** -1.182***
## (0.095) (0.097) (0.113) (0.116) (0.120)
## sulphates 0.679*** 0.696*** 0.680*** 0.857***
## (0.101) (0.103) (0.104) (0.112)
## citric.acid -0.079 -0.155 -0.031
## (0.104) (0.120) (0.123)
## density 15.106 9.946
## (11.927) (11.941)
## chlorides -1.627***
## (0.407)
## ----------------------------------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3 0.3 0.3
## adj. R-squared 0.2 0.3 0.3 0.3 0.3 0.3
## sigma 0.7 0.7 0.7 0.7 0.7 0.7
## F 468.3 370.4 268.9 201.8 161.8 138.8
## p 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1621.8 -1599.4 -1599.1 -1598.3 -1590.3
## Deviance 805.9 711.8 692.1 691.9 691.2 684.3
## AIC 3448.1 3251.6 3208.8 3210.2 3210.6 3196.6
## BIC 3464.2 3273.1 3235.7 3242.4 3248.2 3239.6
## N 1599 1599 1599 1599 1599 1599
## ========================================================================================
The table shows the intercept and slope of the linear models that are used to predict the quality of the wine based on the wine attributes. An important aspect of this table is the R-squared value, which is essentially how much of the wine quality is explained by the model. In the table above, R-squared is about 0.3, which means about 30% of the wine quality rating is predicted by the model. This value is not high, but it does show some correlational modeling.
## [1] 0.6543859
In the residual plot, I used alcohol as the independent predictor of wine quality since it had the highest correlation with quality. The plot shows that using the model, alcohol can predict wine rating within about 0.5 to 1.0 quality rating.
For a number as complicated as wine quality, the predictive model does a decent job of predicting the quality within 0.50 to 1.0 of quality rating. The predictive model is able to predict this number due to the many constraints that are given by quality’s correlation with the various attributes. For example, according to the data earlier, it is suggested that all wines below density 0.992 g/cm^3 have a quality of 7 or 8. By using the constraints supplied by the wine attributes, the quality model is able to predict the a general estimate of quality.
The quality of the wines is very skewed towards mid-range ratings. The quality displays a Gaussian curve with most of the wines having a rating of 5 or 6. Very few of the wines have a rating of 3, 4, 7, and 8. With no wines having a rating below 3 and above 8.
In this plot, the wine attributes most highly correlated with quality are plotted together, which includes, alcohol, volatile acidity, and sulphates. From left to right, the wine quality increases and the spread becomes more concentrated. As the level of sulphates increase, the wine quality also increases and tend to be more concentrated towards having higher alcohol and lower volatile acidity.
In this visualization, wine attributes alcohol, residual sugar and total sulfur dioxide are used to compare the difference between wines of higher quality with wines of lower quality. By arranging the plots with low quality wines (ratings 3-4) on the left and high quality wines (ratings 7-8) on the right, it is clear to see that high quality wines tend to be much more consistent than low quality wines. Low quality wines have varying levels of residual sugar and total sulfur dioxide. High quality wines have similar levels of residual sugar and total sulfur dioxide across most of their wines, tending to be much more consistent with their compositions.
The wine data set contains 1599 observations on 11 attributes of wine quality. To start my investigation, I wanted to figure out which attributes affected wine quality the most and which ones were interesting to explore. I started my investigation with a matrix plot to get an overview of the data and the relationship between the variables. This was a great success, as the matrix plot helped me in plan out my investigation. By looking at the matrix plot, I was able to see which variables had interesting trends to investigate. For example, I found out that the variable citric acid has intermittent peaks. I was also quickly able to find the wine attributes most highly correlated with quality. Thus, I decided to do a bivariate analysis on quality and alcohol, volatile acidity, sulphates, and citric acid. The matrix plot also helped me see that the different wine attributes were correlated with each other, this spurred my analysis into cross-attribute correlation.
The more I continued to explore wine variables, the more complicated it became to find which variables contributed to wine quality. Many of the wine attributes were weakly to moderately correlated with quality, however, many of the attributes were also corrrelated with each other. It became difficult to decipher which attributes were really contributing to wine quality and how much they were contributing, rather than having a third variable effect. Without having more data, I could only infer the relationship between the variables.
A breakthrough was when I used boxplots to plot higher quality wines compared to lower quality wines and discovered that higher quality wines were much more consistent in their composition. This leads me to the idea that although there is a lot of variation among high quality wines and low quality wines across all of the wine attributes. It is the consistency of the wines that makes it higher quality.
Although this data set contained enough observations, most of the observations were for wine quality 5 and 6. There were no observations for wines below quality 3 and above quality 8, although the quality is scored from 0 to 10. Without having information on almost half of the wine qualities, this data set could only predict limitedly. Further investigations should include wine with qualities ranging from 0 to 10. In addition, this data set was collected from Portuguese “Vinho Verde” wine, which may limit its generalization towards all types of red wine. Further investigation could use a data set with variations of red wine.